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Abstract 

Maximum A Posteriori inference in graphical models is often solved via message-passing 
algorithms, such as the junction-tree algorithm, or loopy belief-propagation. The exact 
solution to this problem is well known to be exponential in the size of the model's maximal 
cliques after it is triangulated, while approximate inference is typically exponential in the 
size of the model's factors. In this paper, we take advantage of the fact that many models 
have maximal cliques that are larger than their constituent factors, and also of the fact that 
many factors consist entirely of latent variables (i.e., they do not depend on an observation). 
This is a common case in a wide variety of applications, including grids, trees, and ring- 
structured models. In such cases, we are able to decrease the exponent of complexity for 
O | message-passing by 0.5 for both exact and approximate inference. 



1 Introduction 



It is well-known that exact inference in tree- structured graphical models can be accomplished 
efficiently by message-passing; ope r ations following; a s i mple protocol making use of the distribu- 
tive law (jAii and McEliecd . Hoool : iKschischang et all . l200lh . It is also well-known that exact 
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inference in arbitrary graphical models can be solved by the junction-tree algorithm; its effi- 
ciency is determined by the size of the maximal cliques after triangulation, a quantity related 
to the treewidth of the graph. 

Figure [T] illustrates an attempt to apply the junction-tree algorithm to some graphical models 
containing cycles. If the graphs are not chordal ((a) and (b)), they need to be triangulated, 
or made chordal (red edges in (c) and (d)). Their clique-graphs are then guaranteed to be 
jun ction-trees, and th e distr ibutive law can be applied with the same protocol used for trees; 



Aji and McEliecd (120001 ) for a beautiful tutorial on exact inference in arbitrary graphs. 



see 

Although the models in this example contain only pairwise factors, triangulation has increased 
the size of their maximal cliques, making exact inference substantially more expensive. Hence 
approximate solutions in the original graph (such as loopy belief-propagation, or inference in a 
loopy factor-graph) are often preferred over an exact solution via the junction-tree Algorithm. 

Even when the model's factors are the same size as its maximal cliques, neither exact nor 
approximate inference algorithms take advantage of the fact that many factors consist only 
of latent variables. In many models, those factors that are conditioned upon the observation 
contain fewer latent variables than the purely latent cliques. Examples are shown in Figure [2 
This encompasses a wide variety of models, including grid-structured models for optical flow 
and stereo disparity as well as chain and tree-structured models for text or speech. 

In this paper, we exploit the fact that the maximal cliques (after triangulation) often have 
potentials that factor over subcliques, as illustrated in Figure [TJ We will show that whenever 
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julian . mcauley@nicta . com . au. 



1 



o — o — o 
o — 6 — o 



6 — 6 — 6 

(a) 




Figure 1: The models at left ((a) and (b)) can be triangulated ((c) and (d)) so that the junction- 
tree algorithm can be applied. Despite the fact that the new models have larger maximal cliques, 
the corresponding potentials are still factored over pairs of nodes only. Our algorithms exploit 
this fact. 




(a) 



(b) 





Figure 2: Some graphical models to which our results apply: cliques containing observations have 
fewer latent variables than purely latent cliques. White nodes correspond to the observation, gray 
nodes to the labeling. In other words, cliques containing a white node encode the data likelihood, 
whereas cliques containing only gray nodes encode priors. 

this is the case, the expected computational complexity of exact inference can be improved (both 
the asymptotic upper-bound and the actual runtime). 

Additionally, we will show that this result can be applied so long as those cliques that are 
conditioned upon an observation contain fewer latent variables than those cliques consisting of 
purely latent variables; the 'purely latent' cliques can be pre-processed offline, allowing us to 
achieve the same benefits as described in the previous paragraph. 

We show that these properties reveal themselves in a wide variety of real applications. Both 
of our improvements shall increase the class of problems for which inference via max-product 
belief-propagation is tractable. 

A core operation encountered in the junction-tree algorithm is that of finding the index that 
chooses the largest product amongst two lists of length N: 

i = argmax {v a [i] x v 6 [i]} . (1) 
ie{i..JV} 

Our results stem from the realization that while (eq. []]) appears to be a linear time operation, 
it can be decreased to 0(y/~N) (in the expected case) if we know the permutations that sort v a 
and Vfr. 

A preliminary version of this work appeared in iMcAuley and Caetanol fl2010h . 
1.1 Summary of Results 

A selection of the results to be presented in the remainder of this paper can be summarized as 
follows: 

• We are able to lower the asymptotic expected running time of the max-product belief- 
propagation for any graphical model whose cliques factorize into lower-order terms. 
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• The results obtained are exactly those that would be obtained by the traditional version 
of the algorithm, i.e., no approximations are used. 

• Our algorithm also applies whenever cliques containing an observed variable contain fewer 
latent variables than purely latent cliques, as in Figure [2] (meaning that certain computa- 
tions can be taken offline). 

• For any cliques composed of pairwise factors, we obtain an expected speed-up of at least 
tt(y/~N) (assuming N states per node; ft denotes an asymptotic lower-bound). 

• For example, in models with third-order cliques containing pairwise terms, message-passing 
is reduced from Q(N 3 ) to 0(N 2 \fN), as in Figure [T](d) . For models containing pairwise 
(but purely latent) cliques, message-passing is reduced from Q(N 2 ) to 0(N\/N), as in 
Figure El 

• For cliques composed of K-ary factors, the expected speed-up generalizes to at least 
Q(±Nk), though it is never asymptotically slower than the original solution. 

• The expected-case improvement is derived under the assumption that the order-statistics 
of different factors are independent. 

• If the different factors have 'similar' order statistics, the performance will be better than 
the expected case. 

• If the different factors have 'opposite' order statistics, the performance will be worse than 
the expected case, but is never asymptotically more expensive than the traditional version 
of the algorithm. 

Our results do not apply for every semiring £(+,•), but only to those whose 'addition' 
operation defines an order (for example, min or max); we also assume that under this ordering, 
our 'multiplication' operator satisfies 

a<bAc<d^>a-c<b-d. (2) 

Thus our results certainly apply to the max-sum and min-sum semirings (as well as max-product 
and min-product, assuming non-negative potentials), but not for sum-product (for example). 
Consequently, our approach is useful for computing MAP-states, but cannot be used to compute 
marginal distributions. We also assume that the domain of each node is discrete. 

We shall initially present our algorithm as it applies to models of the type shown in Figure 
CD The more general (and arguably more useful) application of our algorithm to those models in 
Figure [2] shall be deferred until Section SJ where it can be seen as a straightforward generalization 
of our initial results. 



1.2 Related Work 



There has been previous work on speeding-up message-passin g algorithms b y expl oiting some 
type of structure in certain graphical models. For example, iKersting et al.l (120091) study the 
case w here different cliques share the same potential function. In lFelzenszwalb and Huttenlocher 
(120061 ) , fast message-passing algorithms are provided for cases in which the potential of a 2-clique 
is only dependent on the difference of the latent variables (which is common in some computer 
vision applications); the y also show how the algo rithm can be made faster if the graphical model 
is a bipartite graph. In Kumar and Torr ((2006), the a uthors provide f aster algorithms for the 
case in which the potentials are truncated, whereas in Petersen et al. (120081 ) the authors offer 
speed-ups for models that are specifically grid-like. 
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The latter work is perhaps the most similar in spirit to ours, as it exploits the fact that 
certain factors can be sorted in order to reduce the search space of a certain maximization 
problem. In practice, this leads to linear sp eed-ups over a 0(jV 4 ) algo rithm. 

Another closely related paper is that of Park and Darwichel (|2003l ). This work can be seen 
to compliment ours in the sense that it exploits essentially the same type of factorization that 
we study, though it applies to su m-product versio ns of the algorithm, rather than the max- 
product version that we shall study. IKjaerulffl (119981 ) also exploits factorization within cliques of 
junction-trees, albeit a different type of factorization than that studied here. 

In Section El we shall see that our algorithm is closely related to a well-studied problem known 
as 'funny matrix multiplication' ( Kerr . 1970l ). The worst-case complexity of this problem has 



been studied in relation to the all-pairs shortest path problem (lAlon et all 119971 ; iKarger et al. 
1991). 



2 Background 

The notation we shall use is briefly defined in Table [TJ We shall assume throughout that the 
max-product semiring is being used, though our analysis is almost identical for any suitable 
choice. 

MAP-inference in a graphical model Q consists of solving an optimization problem of the 
form 

x = argmax $c(xc), (3) 
x cec 

where C is the set of maximal cliques in Q. This problem is often solved via message-passing 
algorit hms such as the junction - tree algorithm, loopy belief- propagation, or inference in a factor 
graph flAji and McEliecel . l2000l : IWeissl . I2OO0I : iKschischang et all l200lh . 

Two of the fundamental steps encountered in message-passing algorithms are defined below. 
Firstly, the message from a clique X to an intersecting clique Y is defined by 



rax^y(xxny) 



max I $x(xx) IT m z^x(*xnz) 

*X\Y 

v 1 zer(x)\Y 



(4) 



(where T(X) returns the neighbors of the clique X). If such messages are computed after Y has 
received messages from all of its neighbors except X (i.e., T(X) \ Y), then this defines precisely 
the update scheme used by the junction-tree algorithm. The same update scheme is used for 
loopy belief-propagation, though it is done iteratively in a randomized fashion. 

Secondly, after all messages have been passed, the MAP-states for a subset of nodes M 
(assumed to belong to a clique X) is computed using 



^m(xm) max < $i(x x ) TT m z ^x{^xnz) > • 

X X\M 1 

" 1 zer(x) 



(5) 



Often, the clique-potential ^x(^x) shall be decomposable into several smaller factors, i.e., 

$i(xi) = n <Mxf). (6) 

Fax 

Some simple m o tiva ting examples are shown i n Figur e E3 a model for pose estimation from 
Siga! and Blad] t00(j). a 'sk in-chain CRF' from IC.allcvl (120061 ) , and a model for shape matching 
from Coughlan and Ferreira ( 20021 ) . In each case, the triangulated model has third-order cliques, 
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Table 1: Notation 







A T-) 

A; B 


capital letters reler to sets ol nodes (or similarly, 




cliques) ; 


A I I D. A <~\ D. A \ ~D 
AU ID] A\\ ID, A \ ID 


standard set operators are used (A \ B denotes set dif- 




£ \ 

terence) ; 


aom^AJ 


the domain of a set; this is just the Cartesian product 




or the domains or each element m the set; 


P 


111 • i 11 j j £ j 

bold capital letters reler to arrays; 


X 


1111 i ii C i i 

bold lower-case letters refer to vectors; 


r 1 

x[aj 


vectors are indexed using square brackets; 


t» r i 

P[nJ 


•11 11j_ 1 j_ * 1 £ 

similarly, square brackets are used to index a row or a 




O 1 

z-d array, 


t» r i 
P[n] 


£ ( 1 1 i 1 \ 1 * * 1 

or a row or an ( n + lj-dimensional array; 




superscripts are just labels, i.e., P is an array, v is a 




vector; 




constant subscripts are also labels, i.e., if a is a constant, 




then v a is a constant vector; 


a^; x^4 


variable subscripts define variables; the subscript de- 




O j 1 1 • P j 1 • 1 1 

fines the domain of the variable; 


n|x 


it n is a constant vector, then n\x is the restriction of 




that vector to those indices corresponding to variables 




in X (assuming that X is an ordered set); 


$ A ;$ A (x A ) 


a function over the variables in a set A] the argument 




x^ will be suppressed if clear, given that 'functions' are 




essentially arrays for our purposes; 


^i,J (^5 ^j) 


a function over a pair of variables (x^,Xj); 


$A(n| B ;x A \ B ) 


if one argument to a function is constant (here n b), 


then it becomes a function over fewer variables (in this 




case, only ^a\b 1S free); 



but the potentials are only pairwise. Other examples have already been shown in Figure [TJ 
analogous cases are ubiquitous in many real applications. 

The optimizations we suggest shall apply to general problems of the form 



^m(xm) max I f $f(xf), (7) 

^ M FCX 

which subsumes both (eq. Hj) and (eq. [5]), where we simply treat the messages as factors of the 
model. Algorithm [T] gives the traditional solution to this problem, which does not exploit the 
factorization of This algorithm runs in 0(7Vl x l), where N is the number of states 

per node, and \X\ is the size of the clique X (we assume that for a given xj, computing 
Hfcx ^f(xf) takes constant time, as our optimizations shall not modify this cost). 
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(a) 



(b) 



(c) 



Figure 3: ( a) A m o del f or pose reconstruction from Sigal and Black ( 20061): (b) A 'skip-chain 
CRF' from Galley (120061 ) ; (c) A model for deformable matching from ICoughlan and Ferreira 
( 20021 ). Although the (triangulated) models have cliques of size three, their potentials factorize 
into pairwise terms. 



Algorithm 1 Brute-force computation of max-marginals 

Input: a clique X whose max-marginal 7tim(xm) (where M C X) we wish to compute; assume 
that each node in X has domain {1 . . . N} 
1: for m G dom(M) {i.e., {1 . . . N} |M| } do 
2: max := — oc 
3: for y e dom(X \ M) do 
4: if n^cx ^f(™|f; y\f) > Tnax then 
5: max := Yl FcX $f(™\f] y\f) 

6: end if 

7: end for {this loop takes Q(N\ X \ M \)} 

8: 771m (m) : = VflQjX 

9: end for {this loop takes 9(A^ |X| )} 
10: Return: ttim 
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3 Optimizing Algorithm U 



In order to specify a more efficient version of Algorithm [H we begin by considering the simplest 
possible nontrivial factorization: a clique of size three containing pairwise factors. In such a 
case, our aim is to compute 

m ii j(xi,Xj) = max$ij ik (xi,Xj,Xk), (8) 

x k 

which we have assumed takes the form 

mij(xi,Xj) = max$ij(xi,Xj) x ^^ k {x h x k ) x $ jjk (xj,x k ). (9) 

x k 

For a particular value of (xi,xj) = (a, 6), we must solve 

rriij(a,b) = $ij(a,b) x max$^(a,x fe ) x $ jik (b,x k ), (10) 

Xk N v / N v ' 

which we note is in precisely the form shown in (eq. []]) . 

There is a close resemblance between (eq. [TUl) and the problem of multiplying two matrices: 
if the 'max' in (eq. [TUl) is replaced by summation, we essentially recover traditional matrix 
multiplication. W hile traditional matrix multiplication is well known to have a sub-cubic worst- 
case solution (see IStrassenl . 19691 ). the version in (eq. [TUl) (often referred to as 'funny matrix 



multiplication', or simply 'max-product matrix multiplication') is known to be cubic in th e 
worst case, assuming that only multiplication and comparison operations are used (iKerr . 1970l ). 



The complexity of solving (eq. fTO]) can also be show n to be equivalent to the all-pairs shortest 
path problem, which is studied in klon et al.1 (jl997h . Not surprisingly, we shall not improve the 



worst-case complexity, but shall instead give far better expected- case performance than existing 
solutions. Just as Strassen's algorithm can be used to solve (eq. [TUl) when maximization is 
replaced by summation, there has been work studying the problem of s um-product inference 



in gr aphical models, subject to the same type of factorization we discuss (jPark and Darwiche 



2003|). 

As we have previously suggested, it will be possible to solve (eq. [TO]) efficiently if v a and 
are already sorted. We note that v a will be reused for every value of Xj, and likewise will be 
reused for every value of X{. Sorting every row of and $j k can be done in <d(N 2 log N) (for 
2 N rows of length N) . 

The following elementary lemma is the key observation required in order to solve (eq. [TUl) 
efficiently: 

Lemma 1. If the p th largest element ofv a has the same index as the q th largest element o/v&, 
then we only need to search through the p largest values of v a , and the q largest values o/ v&; 
any corresponding pair of smaller values could not possibly be the largest solution. 

This observation is used to construct Algorithm [2l Here we iterate through the indices start- 
ing from the largest values of v a and v&, stopping once both indices are 'behind' the maximum 
value found so far (which we then know is the maximum). This algorithm is demonstrated 
pictorially in Figure [H 

A prescription of how Algorithm [2] can be used to solve (eq. Ej) is given in Algorithm [3l 
Determining precisely the running time of Algorithm [2] (and therefore Algorithm [3j) is not trivial, 
and will be explored in depth in Appendix lAl We note that if the expected-case running time of 
Algorithm [2] is 0(f(N)), then the time taken to solve Algorithm [3] shall be 0(N 2 (log N+f(N))). 
At this stage we shall state an upper-bound on the true complexity in the following theorem: 

Theorem 1. The expected running time of Algorithmic is 0{\fN), yielding a speed-up of at 
least Q,(y/N) in cliques containing pairwise factors. 
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Algorithm 2 Find i such that v a [i] x v^[i] is maximized 

Input: two vectors v a and V5, and permutation functions p a and ^5 that sort them in decreasing 

order (so that v a [p a [l]] is the largest element in v a ) 
1: Initialize: start := 1, end a := p~ end^ := p^b^l]] {if end^, = k, then the largest 

element in v a has the same index as the k th largest element in V5} 
2: best :=p a [l], max := v a [&est] x v^&est] 
3: if v a [pb[l]] x v&[pb[l]] > max then 
4: frest :=pb[l], max := v a [6est] x v^&est] 
5: end if 

6: while start < end a {in practice, we could also stop if start < end^, but the version given 

here is the one used for analysis in Appendix lAl} do 
7: start := start + 1 

8: if v a [p a [start}] x Vfc[p a [start]] > max then 

9: frest := p a [start] 
10: max := v a [best] x v^&est] 
11: end if 

12: if p^" 1 [p a [start]] < end}) then 
13: end*, := p^ l [p a [stari\] 
14: end if 

15: {repeat Lines [HHHJ interchanging a and 6} 
16: end while {this takes expected time 0(y/~N)} 
17: Return: best 



Algorithm 3 Use Algorithm [2] to compute the max-marginal of a 3-clique containing pairwise 

factors 

Input: a potential ^^(a, 6, c) = $ij(a, 6) x $^(a,c) x $^(6, c) whose max-marginal 
rriij(xi,Xj) we wish to compute 
1: for n E {1 . . . iV} do 

2: compute P z [n] by sorting ^^(n, {takes 6(7Vlog iV)} 

3: compute P J [n] by sorting {P* and P J are N x N arrays, each row of which 

is a permutation; $^(71, and $^(71, x^) are functions over x^, since n is constant in 
this expression} 

4: end for {this loop takes 6(7V 2 log7V)} 

5: for (a, b) e {1...N} 2 do 

6: (v a , v 6 ) := (*i,A.(a, $^(6, a*)) 

7: (p a ,p 6 ):=(PMa] 5 P^'[&]) 

8: frest := AlgorithrrlQl^aiVbiPaiPb) {takes 0(\/iV)} 

9: rriij(a, b) := $ij(a, 6) x 3>;,/c(a, ftest) x $^(6, feest) 
10: end for {this loop takes 0(7V 2 \/iV)} 

{the total running time is 0(7V 2 log N + N 2 \fN), which is dominated by O(7V 2 \/A0} 
11: Return: rriij 
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Figure 4: Algorithm El explained pictorially. The arrows begin at p a [start] and pb[start]; the red 
line connects end a and end^, behind which we need not search; a dashed arrow is used when a 
new maximum is found. Note that in the event that v a and contain repeated elements, they 
can be sorted arbitrarily. 
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Figure 5: The reasoning applied in Algorithm [2] applies even when the elements of p a and p^ are 
multidimensional indices. 

3.1 An Extension to Higher-Order Cliques with Three Factors 

The simplest extension that we can make to Algorithms El and El is to note that they can be 
applied even when there are several overlapping terms in the factors. For instance, Algorithm [3] 
can be adapted to solve 

^iji^ii •Ej) ~ 3Cj) X $2,fc,m(^i5 ^/c? *^m) X Qjkm^Xj , Xfc , X m ) , (H) 

and similar variants containing three factors. Here both x& and x m are shared by <fr^,m an d 
&j,k : m- We can follow precisely the reasoning of the previous section, except that when we sort 
&i,k,m (similarly $j^,m) f° r a fixed value of we are now sorting an array rather than a vector 
(Algorithm El Lines El and E]) ; in this case, the permutation functions p a and p^ in Algorithm [2] 
simply return pairs of indices. This is illustrated in Figure El Effectively, in this example we 
are sorting the variable := x^ ® x m , which has state space of size TV 2 . 

As the number of shared terms increases, so does the improvement to the running time. 
While (eq. [TT]) would take 0(7V 4 ) to solve using Algorithm [H it takes only 0(N 3 ) to solve using 
Algorithm [3] (more precisely, if Algorithm El takes 0(f(N)), then (eq. E]) takes 0(N 2 f(N 2 )), 
which we have mentioned is 0(N 2 V N 2 ) = 0(N 3 )). In general, if we have S shared terms, 
then the running time is 0(N 2 V N s ), yielding a speed-up of fl(V N s ) over the naive solution of 
Algorithm [TJ 

3.2 An Extension to Higher-Order Cliques with Decompositions Into Three 
Groups 

By similar reasoning, we can apply our algorithm to cases where there are more than three 
factors, in which the factors can be separated into three groups. For example, consider the 
clique in Figure [6](a), which we shall call G (the entire graph is a clique, but for clarity we only 
draw an edge when the corresponding nodes belong to a common factor). Each of the factors 
in this graph have been labeled using either differently colored edges (for factors of size larger 
than two) or dotted edges (for factors of size two), and the max-marginal we wish to compute 
has been labeled using colored nodes. We assume that it is possible to split this graph into three 
groups such that every factor is contained within a single group, along with the max-marginal 
we wish to compute (Figure El (b)). If such a decomposition is not possible, we will have to 
resort to further extensions to be described in Section 13.31 

Ideally, we would like these groups to have size ~ |G|/3, though in the worst case they 
will have size no larger than |G| — 1. We call these groups X, Y, Z, where X is the group 
containing the max-marginal M that we wish to compute. In order to simplify the analysis 
of this algorithm, we shall express the running time in terms of the size of the largest group, 
S = max(|X|, \ Y |, |Z|), and the largest difference, S\ = max(|y |Z\X|). The max-marginal 
can be computed using Algorithm [H 

The running times shown in Algorithm [H are loose upper-bounds, given for the sake of 
expressing the running time in simple terms. More precise running times are given in Table El 
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(a) 



(b) 



(a) We begin with a set of factors (indicated using colored lines), which are assumed to belong to some clique in 
our model; we wish to compute the max-marginal with respect to one of these factors (indicated using colored 
nodes); (b) The factors are split into three groups, such that every factor is entirely contained within one of them 
(Algorithm [H line [J). 




(c) (d) (e) 

(c) Any nodes contained in only one of the groups are marginalized (Algorithm 2] lines [2] [3] and 2} ; the problem 
is now very similar to that described in Algorithm [3] except that nodes have been replaced by groups; note that 
this essentially introduces maximal factors in Y' and Z (d) For every value (a, b) £ dom(x3,X4), ^ Y (a,b,X6) is 
sorted (Algorithm^ lines[5HZl); ( e ) For every value (a, b) £ dom(x2,X4), ^ Z (a 1 b 1 xq) is sorted (Algorithm [4J lines 

MM- 




^ (f) (g) 

(f) For every n £ dom(X x ), we choose the best value of xq by Algorithm [2] (Algorithm |4] lines [TTHT6]) ; (g) The 
result is marginalized with respect to M (Algorithm [4] line I17|) . 



Figure 6: Algorithm [H explained pictorially. In this case, the most computationally intensive 
step is the marginalization of Z (in step (c)), which takes 0(7V 5 ). However, the algorithm 
can actually be applied recursively to the group resulting in an overall running time of 
0(N 4 y/N), for a max-marginal that would have taken 0(7V 8 ) to compute using the naive solution 
of Algorithm [TJ 
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Algorithm 4 Compute the max-marginal of G with respect to M, where G is split into three 
groups 

Input: potentials 3>g(x) = ^x(xj) x $y(xy) x $^(x^); each of the factors should be contained 
in exactly one of these terms, and we assume that M C X (see Figure [6j) 

1: Define: A' := ((Y U Z) n X) U M; Y' := (IUZ)H Y; Z' := (X U Y) n Z {X f contains 
the variables in X that are shared by at least one other group; alternately, the variables in 
X \ X' appear only in X (sim. for Y' and Z')} 

2: compute ^ x (pcx f ) '•— max x\x / ^i(xx) {we are marginalizing over those variables in X that 
do not appear in any of the other groups (or in M); this takes Q(N S ) if done by brute force 
(Algorithm [T]) , but may also be done by a recursive call to Algorithm (4)} 

3: compute $ y (xy/) := maxy\y/ $y(xy) 

4: compute ^ z (-xz') := max^^/ &z(*z) 

5: for n <E dom(A fl Y) do 

6: compute P y [n] by sorting * y (n;x y /^) {takes Q(S\N s \\ogN); * y (n;xy/^) is free 
over Xy/^j, and is treated as an array by 'flattening' it; P y [n] contains the \Y f \ X\ = 
|(Y fl Z) \ A | -dimensional indices that sort it} 
7: end for {this loop takes Q(S\N S log A)} 
8: for n e dom(A fl Z) do 
9: compute P z [n] by sorting ^ z (n;x Z /\ x ) 
10: end for {this loop takes @(S\N S log N)} 
11: for n e dom(A / ) do 

12: (v a ,v&) := (n|y/; Xy/\^/), ^ Z (p\z r i *-z'\X')) { n l^ ; * s the 'restriction' of the vector n 
to those indices in Y' (meaning that n|y/ E dom(A' HY')); hence (n|y/; Xy'\X') * s f ree 
in xy/\p, while n|y/ is fixed} 

13: (p a ,p 6 ):=(P y [n| y ,],P^[n|^]) 

14: best := Algorithrr^KyaiVbiPaiPb) {takes 0(y/#\)} 

15: mx(n) := * x (n) x ^ y (best; n|y/) x ^/ z (best; n\ z >) 

16: end for 

17: tum^m) '•— Algorithrr^mx, M) {i.e., we are using Algorithm [1] to marginalize mj(xx) 
with respect to M; this takes 6 (AT 5 ')} 
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Table 2: Detailed running time analysis of Algorithm^! any of these terms may be asymptotically 
dominant 



Description 



lines 



time 



Marginalization of <3>x, without recursion 
Marginalization of <3>y 
Marginalization of <f>z 
Sorting <E>y 
Sorting <3>^ 

Running Algorithm [2] on the sorted values 



El 

3 

CD 

EH3 



e(ivi A i) 
e(ivi^i) 



EM 
MM 



Q(\Y'\X\N\ Y '\\og N) 
e(\Z'\X\ N\ z '\]ogN) 
0(N\ x 'WN\( Y ' nz ')\ x '\) 



Graph: 





with pairwise 



{A complete 
graph K M , 



terms} 



Algorithm [TJ 
Algorithm 0J 
Speed-up: 



(a) 

e(A 5 ) 

0{N z ^fN) 



(b) 

e(A 3 ) 

0{N 2 \fN) 



(c) 
6(A n ) 
0{N 6 \TN) 
Cl(N 4 VN) 



(d) 

e(A 6 ) 

0(A 5 ) 
f2(A) 



Q(N M ) 
0(N 5M / 6 ) 
f2(A M / 6 ) 



Figure 7: Some example graphs whose max-marginals are to be computed with respect to the 
colored nodes, using the three regions shown. Factors are indicated using differently colored 
edges, while dotted edges always indicate pairwise factors, (a) is the region Z from Figure [6] 
(recursion is applied again to achieve this result); (b) is the graph used to motivate Algorithm 
02 (c) shows a query in a graph with regular structure; (d) shows a complete graph with six 
nodes; (e) generalizes this to a clique with M nodes. 



any of the terms shown in Table [2] may be dominant. Some example graphs, and their resulting 
running times are shown in Figure [3 



3.2.1 Applying Algorithm [4] Recursively 

The marginalization steps of Algorithm [4] (Lines El El and[4j) may further decompose into smaller 
groups, in which case Algorithm [H can be applied recursively. For instance, the graph in Figure 
[3(a) represents the marginalization step that is to be performed in Figure [6](c) (Algorithm [H 
Line[4j). Since this marginalization step is the asymptotically dominant step in the algorithm, 
applying Algorithm [4] recursively lowers the asymptotic complexity. 

Another straightforward example of applying recursion in Algorithm [4| is shown in Figure 
[HI in which a ring-structured model is marginalized with respect to two of its nodes. Doing 
so takes 0(MN 2 \fN); in contrast, solving the same problem using the junction-tree algorithm 
(by triangulating the graph) would take 6(MiV 3 ). Loopy belief-propagation takes 6(MiV 2 ) 
per iteration, meaning that our algorithm will be faster if the number of iterations is Q(yN). 
Naturally, Algorithm [3] could be applied directly to the triangulated graph, which would again 
take 0(MN 2 VN). 
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+ 




0(2N 2 VN) 



+ 




' 2 VN) (by Algorithm ED 



Figure 8: In the above example, lines [SHU of Algorithm [4| are applied recursively, achieving a 
total running time of 0(MN 2 \fN) for a loop with M nodes (our algorithm achieves the same 
running time in the triangulated graph). 

3.3 A General Extension to Higher-Order Cliques 

Naturally, there are cases for which a decomposition into three terms is not possible, such as 



(i.e., a clique of size four with third-order factors). However, if the model contains factors of 
size K, it must always be possible to split it into K+l groups (e.g. four in the case of (eq. [T2]) ). 

Our optimizations can easily be applied in these cases simply by adapting Algorithm [2] to 
solve problems of the form 



Pseudocode for this extension is presented in Algorithm [5l Note carefully the use of the variable 
read: we are storing which indices have been read to avoid re-reading them; this guarantees that 
our Algorithm is never asymptotically worse than the naive solution. Figure [9] demonstrates 
how such an algorithm behaves in practice. Again, we shall discuss the running time of this 
extension in Appendix lAl For the moment, we state the following theorem: 

Theorem 2. Algorithm^ generalizes Algorithm^ to K lists with an expected running time of 
0(KN k ) ; yielding a speed-up of at least fi(-^TV^) in cliques containing K-ary factors. It is 
never worse than the naive solution, meaning that it takes 0(min(7V, KN k )). 

Using Algorithm El we can similarly extend Algorithm [H to allow for any number of groups 
(pseudocode is not shown; all statements about the groups Y and Z simply become statements 
about K groups {G\ . . . Gk}, and calls to Algorithm [2] become calls to Algorithm [5j) . The one 
remaining case that has not been considered is when the sequences vi • • • are functions of 
different (but overlapping) variables; naively, we can create a new variable whose domain is the 




i — argmax {vi[i] x V2[i] x ••■ x v k[^]} • 
ie{l...N} 



(13) 



14 



Algorithm 5 Find i such that Yl k= i v&[i] is maximized 

Input: K vectors vi . . . v^; permutation functions p\ . . .px that sort them in decreasing order; 
a vector read indicating which indices have been read, and a unique value T ^ read {read is 
essentially a boolean array indicating which indices have been read; since creating this array 
is an O(N) operation, we create it externally, and reuse it O(N) times; setting read[i] = T 
indicates that a particular index has been read; we use a different value of T for each call 
to this function so that read can be reused without having to be reinitialized} 

1: Initialize: start := 1, 

max := max p6{pi ... PK} n£=i v kW}}, 
best := argmaXp G{pi .. w} Uk=i v kW]] 

2: for k e 1 ... K do 

3: end k := m&x qe{pi _ PK} p^Ml]] 

4: end for 

5: read[p[l}} = T 

6: while start < maxjenrfi . . . end^} do 

7: start := start + 1 

8: if read[p[start\] := T then 

9: continue 
10: end if 
11: read[p[start\] \— T 
12: m := maXp G{pi ...p K } nf=i v fe [p[5tart]] 
13: b := axgmaXp G{pi _ PK} nf=i v fc [p[start]] 
14: if m > max then 
15: 6est := b 
16: max := m 
17: end if 

18: for k e {1 ... do 

19: e k := maXg 6{pii>iPK } p^ 1 [g [start]] 

20: end for 

21: for fc G {1 . • . do 

22: enrf/e := min(e^, end^) 

23: end for 

24: end while {see Appendix lAl for running times} 
25: Return: best 
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Figure 9: Algorithm [2] can easily be extended to cases including more than two sequences. 
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product space of all of the overlapping terms, and still achieve the performance improvement 
guaranteed by Theorem [2j in some cases, better results can again be obtained by applying 
recursion, as in Figure [3 

As a final comment we note that we have not provided an algorithm for choosing how to 
split the variables of a model into (K + l)-groups. We note even if we split the groups in a naive 
way, we are guaranteed to get at least the performance improvement guaranteed by Theorem 
EJ though more 'intelligent' splits may further improve the performance. Furthermore, in all of 
the applications we have studied, K is sufficiently small that it is inexpensive to consider all 
possible splits by brute-force. 

4 Exploiting 'Data Independence 5 in Latent Factors 

While (eq. [3]) gave the general form of MAP-inference in a graphical model, it will often be 
more convenient to express our objective function as being conditioned upon some observation, 
y. Thus inference consists of solving an optimization problem of the form 

x(y) = argmax $c(xc|yc). (14) 
x CeC 

When our objective function is written in this way, further factorization is often possible, yielding 
an expression of the form 

x(y) = argmax \[ $ F (x F |y F ) x $ c (xc), (15) 
x FeF CeC 

V v ' V v ' 

data dependent data independent 

where each F E T is a subset of some C E C. We shall say that those factors that do not depend 
on the observation are 'data independent'. 

By far the most common instance of this type of model has 'data dependent' factors consisting 
of a single latent variable, and conditioned upon a single observation, and 'data independent' 
factors consisting of a pair of latent variables. This was precisely the class of models depicted 
at the beginning of our paper in Figure EJ whose objective function takes the form 

x(y) argmax n^to) x J\ ®hj( x ii x j) ( 16 ) 

node potential edge potential 

(where Af and £ are the set of nodes and edges in our graphical model). As in the Section El 
we shall concern ourselves with this version of the model, and explain only briefly how it can be 
applied with larger factors, as in Section 13.21 

Note that in (eq. [T6]) we are no longer concerned solely with exact inference via the junction- 
tree algorithm. In many models, such as grids and rings, (eq. [T6]) shall be solved approximately 
by means of either loopy belief-propagation, or inference in a factor graph. 

Given the decomposition of (eq. [T6]h message-passing now takes the form 

m A -+B{q) = Mq) x niax^fejx $ij(g,3/j) (17) 

(where A = (z, j) and B = (z, k)). Just as we made the comparison between (eq. [TUl) and matrix 
multiplication, we can see (eq. [T71) as being related to the multiplication of a matrix (<&ij) with a 
vector (<frj), again with summation replaced by maximization. Given the results we have already 
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shown, it is trivial to solve (eq. [TTD in 0(Ny/~N) if we know the permutations that sort <frj, and 
the rows of <f>ij. The algorithm for doing so is shown in Algorithm [61 The difficultly we face 
in this instance is that sorting the rows of <&^j takes 0(7V 2 log7V), i.e., longer than Algorithm [S] 
itself. 

This problem is circumvented due to the following simple observation: since <&ij(xi,Xj) 
consists only of latent variables (and not upon the observation), this 'sorting' step can take 
place offline, i.e., before the 'data' has been observed. 

Two further observations mean that even this offline cost can often be avoided. Firstly, 
many models have a 'homogeneous' prior, i.e., the same prior is shared amongst every edge (or 
clique) of the model. In such cases, only a single 'copy' of the prior needs to be sorted, meaning 
that in any model containing Q(logN) edges, speed improvements can be gained over the naive 
implementation. Secondly, where an iterative algorithm (such as loopy belief-propagation) is to 
be used, the sorting step need only take place prior to the first iteration; if Q(\ogN) iterations 
of belief propagation are to be performed (or indeed, if the number of edges multiplied by the 
number of iterations is f2(log iV)) 3 we shall again gain speed improvements even when the sorting 
step is done online. 

In fact, the second of these conditions obviates the need for data independence altogether. 
In other words, in any pairwise model in which fi(logiV) iterations of belief propagation are to 
be performed, the pairwise terms need to be sorted only during the first iteration. Thus these 
improvements apply to those models in Figure [U so long as the number of iterations is Jl(log N). 



Algorithm 6 Solve (eq. [T71) using Algorithm [2] 

Input: a potential <&ij(a,b) x <E>^(a|^) x <frj(b\yj) whose max-marginal rrii(xi) we wish to com- 
pute, and a set of permutation functions P such that P[i] sorts the i th row of (in 
decreasing order). 

1: compute the permutation function p a by sorting tyj {takes Q(N log N)} 

2: for q e {1 . . . N} do 

3: (v fl ,v 6 ) := (^j,^ij(q,Xj\yi,yj)) 

4: best := AlgorithrrfR(v a ,v b ,p a ,P[q]) {0(y/N)} 

5: rriA^B(q) '= ®i(q) x 9 j (best) x $ij(q, best\yi,yj) 

6: end for {expected- case 0(Ny/N)} 

7: Return: ttia^b 



4.1 Extension to Higher-Order Cliques 

Just as in Section 13.21 we can extend Algorithm [6] to factors of any size, so long as the purely 
latent cliques contain more latent variables than those cliques that depend upon the observation. 
The analysis for this type of model is almost exactly the same as that presented in Section 13.21 
except that any terms consisting of purely latent variables are processed offline. 

As we mentioned in 13. 2\ if a model contains (non-maximal) factors of size K , we will gain 
a speed-up of Q(-^Nk). If in addition there is a factor (either maximal or non-maximal) 

consisting of purely latent variables, we can still obtain a speed-up of ^}(^-^N K + 1 ), since this 
factor merely contributes an additional term to (eq. [T3l). Thus when our 'data-dependent' terms 
contain only a single latent variable (i.e., K — 1), we gain a speed-up of Q(y/N), as in Algorithm 

E 



17 



5 Performance Improvements in Existing Applications 



Our results are immediately compatible with several applications that rely on inference in graph- 
ical models. As we have mentioned, our results apply to any model whose cliques decompose 
into lower- order terms. 

Often, potentials are defined only on nodes and edges of a model. A D th -oidei Markov 
model has a tree- width of D, despite often containing only pairwise relationships. Similarly 
'skip-chain C RgvjSutton and McCalluiiJ . l200^ and junction-trees used in SLAM 

applications (Paskin, 20031 ) often contain only pairwise terms, and may have low tree width under 
reasonable conditions. In each case, if the tree-width is D, Algorithm |4] takes 0(MN D \fN) (for 
a model with M nodes and N states per node), yielding a speed-up of Q(y/N). 



Models for shape matching and pose reconstru ction often exhibit similar properties (jTresadern et all 
20091 : bonner et all . I2QQ7I ; ISigal and Blackl . l2006h . In each case, third-order cliques factorize into 
second order terms; hence we can apply Algorithm [3] to achieve a speed-up of tt(y/N). 



Another similar model for shape matching is that of iFelzenszwalbl (120051 ): this model again 
contains third-order cliques, though it includes a 'geometric' term constraining all three variables. 
Here, the third-order term is independent of the input data, meaning that each of its rows can 
be sorted offline, as described in Section [H In this case, those factors that depend upon the 
observation are pairwise, meaning that we achieve a speed-up of Q(N^). Further applications 
of this type shall be explored in Section [6721 

In iCoughlan and Ferreira (2002), deformable shape-matching is solved approximately using 
loopy belief-propagation. Their model has only second-order cliques, meaning that inference 
takes Q(MN 2 ) per iteration. Although we cannot improve upon this result, we note that we can 
typically do exact inference in a single iteration in 0(MN 2 \fN)] thus our model has the same 
running time as 0(y/~N) iteratio ns of the ori ginal version. This result applies to all second-order 
models containing; a sine: le loop (jWeissl . l20Q(T >. 



In lMcAuley et al.l (120081 ). a model is presented for graph-matching using loopy belief-propagation; 
the maximal cliques for D-dimensional matching have size (D + 1), meaning that inference takes 
@(MN D+1 ) per iteration (it is shown to converge to the correct solution); we improve this to 

o(mn d Vn). 



Interval graphs can be used to model resource allocation problems (IFulkerson and Grossl . 
19651 ): each node encodes a request, and overlapping requests form edges. Maximal cliques grow 
with the number of overlapping requests, though the constraints are only pairwise, meaning that 
we again achieve an Q(y/~N) improvement. 



Belief-propagati on can be used to solve LP -relaxations in pairwise graphical models. In 
Sontag et al. (l2008h . LP-relaxations are computed for pairwise models by constructing several 
third-order 'clusters', which compute pairwise messages for each of their edges. Again, an Q(y/~N) 
improvement is achieved. 

Finally, in Section 16.21 we shall explore a variety of applications in which we have pairwise 
models of the form shown in (eq. [T6l) . In all of these cases, we see an (expected) reduction of a 
message-passing algorithm to O(MNyfN). 

Table [3] summarizes these results. Reported running times reflect the expected case. Note 
that we are assuming that max-product belief -propagation is being used in a discrete model; some 
of the referenced articles may use different variants of the algorithm (e.g. Gaussian models, 
or approximate inference schemes). We believe that our improvements may revive the exact, 
discrete version as a tractable option in these cases. 
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Table 3: Some existing work to which our results can be immediately applied (M nodes, N 
states per node, cliques of size \C\. 'iter.' denotes that the algorithm is iterative). 



Reference 


description 


running time 


our method 


McAulev et al. (2008) 
Sutton and McCallum (2006) 
Gallev (2006) 

Paskin (2003) (discrete case) 
Tresadern et al. (2009) 
Coughlan and Ferreira (2002 ) 
Sigal and Black (2006) 
Felzenszwalb (2005) 
Fulkerson and Gross (1965) 
Sontag et al. (2008) 


.D-d graph-matching 
Width-D skip-chain 
Width-3 skip-chain 
SLAM, width D 
Deformable matching 
Deformable matching 
Pose reconstruction 
Deformable matching 
Width-D interval graph 
LP with M clusters 


S(MN D+1 ) (iter.) 
0(MN D ) 
G(MN 3 ) 
0(MN D ) 

e(MN 3 ) 

Q(MN ) (iter.) 
S(MN ) 

s(mn 3 ) 

0(MN D+1 ) 
G(MN 3 ) 


0(MN D VN) (iter.) 

o\mn d ~ 1 \/n) 

o(mn 2 Vn) 

oImn - 1 ^) 

o(mn 2 Vn) 

o(mn 2 Vn) 

o[mn 2 Vn) 

e(MJVt) (online) 
0(MN D ^N) 

o(mn 2 Vn) 



6 Experiments 

We present experimental results for two types of models: those whose cliques factorize into 
smaller terms, as discussed in Section^ and those whose factors that depend upon the observation 
contain fewer latent variables than their maximal cliques, as discussed in Section [H 

6.1 Experiments with Within-Clique Factorization 

In this section we present experiments in models whose cliques factorize into smaller terms, as 
discussed in Section El We also use this section to demonstrate Theorems [1] and [2] experimentally. 

6.1.1 Comparison Between Asymptotic Performance and Upper-Bounds 

For our first experiment, we compare the performance of Algorithms [2] and [5] to the naive 
solution of Algorithm [TJ These are core subroutines of each of the other algorithms, meaning 
that determining their performance shall give us an accurate indication of the improvements we 
expect to obtain in real graphical models. 

For each experiment, we generate N i.i.d. samples from [0, 1) to obtain the lists v\ . . .vk- 
N is the domain size; this may refer to a single node, or a group of nodes as in Algorithm [5j 
thus large values of N may appear even for binary-valued models. K is the number of lists in 
(eq. [T3lk we can observe this number of lists only if we are working in cliques of size K + 1, and 
then only if the factors are of size K (e.g. we will only see K — 5 if we have cliques of size 6 with 
factors of size 5); therefore smaller values of K are probably more realistic in practice (indeed, 
all of the applications in Section [5] have K — 2). 

The performance of our algorithm is shown in Figure [TUl for K — 2 to 4 (i.e., for 2 to 4 
lists). When K — 2, we execute Algorithm EJ while Algorithm [5] is executed for K > 3. The 
performance reported is simply the number of elements read from the lists (which is at most 
K x start). This is compared to N itself, which is the number of elements read by the naive 
algorithm. The upper-bounds we obtained in (eq. l38l) are also reported, while the true expected 
performance (i.e., (eq. l25l) ) is reported for K — 2. Note that the variable read was introduced 
into Algorithm [5] in order to guarantee that it can never be asymptotically slower than the naive 
algorithm. If this variable is ignored, the performance of our algorithm deteriorates to the point 
that it closely approaches the upper-bounds shown in Figure [THl Unfortunately, this optimization 
proved overly complicated to include in our analysis, meaning that our upper-bounds remain 
highly conservative for large K. 
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Performance and bounds for 2 lists Performance and bounds for 3 lists Performance and bounds for 4 lists 




Figure 10: Performance of our algorithm and bounds. For K — 2, the exact expectation is 
shown, which appears to precisely match the average performance (over 100 trials). The dotted 
lines show the bound of (eq. [35]). While the bound is close to the true performance for K = 2, 
it becomes increasingly loose for larger K. 

6.1.2 Performance Improvement for Dependent Variables 

The expected-case running time of our algorithm was derived under the assumption that each 
list has independent order statistics, as was the case for our previous experiment. We suggested 
that we will obtain worse performance in the case of negatively correlated variables, and better 
performance in the case of positively correlated variables; we shall assess these claims in this 
experiment. 

Figure [TT] shows how the order-statistics of v a and can affect the performance of our algo- 
rithm. Essentially, the running time of Algorithm [2] is determined by the level of 'diagonalness' 
of the permutation matrices in Figure [TTJ highly diagonal matrices result in better performance 
than the expected case, while highly off-diagonal matrices result in worse performance. The 
expected case was simply obtained under the assumption that every permutation is equally 
likely. 

We report the performance for two lists (i.e., for Algorithm [2]) , whose values are sampled 
from a 2-dimensional Gaussian, with covariance matrix 



1 c 
c 1 



(18) 



meaning that the two lists are correlated with correlation coefficient c. In the case of Gaussian 
random variables, the correlation coefficient precisely captures the 'diagonalness' of the matrices 
in Figure [TTJ Performance is shown in Figure [12] for different values of c (c = 0, is not shown, 
as this is the case observed in the previous experiment). 

6.1.3 2-Dimensional Graph Matching 

Naturally, Algorithm [4] has additional overhead compared to the naive solution, meaning that it 
will not be beneficial for small N. In this experiment, we aim to assess the extent to which ou r 



approach is useful in real applications. We reproduce the model from iMcAuley et all (l200£ 
which performs 2-dimensional graph matching, using a loopy graph with cliques of size three, 
co ntaining; only se c ond o rder potentials (as described in Section [5]); the 0(7VM 3 ) performance 



of IMcAuley et al.l ((2008) is reportedly state-of-the-art. We also show the performance on a 
graphical model with random potentials, in order to assess how the results of the previous 
experiments are reflected in terms of actual running time. 

We perform matching between a template graph with M nodes, and a targe t graph with N 



nodes, which requires a graphical model with M nodes and N states per node (see IMcAuley et al. 
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Figure 11: Different permutation matrices and their resulting cost (in terms of entries 
read/multiplications performed). Each permutation matrix transforms the sorted values of one 
list into the sorted values of the other, i.e., it transforms v a as sorted by p a into V5 as sorted by 
Pb. The red squares show the entries that must be read before the algorithm terminates (each 
corresponding to one multiplication). See Figure [21] for further explanation. 




Figure 12: Performance of our algorithm for different correlation coefficients. The top three plots 
show positive correlation, the bottom three show negative correlation. Correlation coefficients 
of c = 1.0 and c = —1.0 capture precisely the best and worst-case performance of our algorithm, 
resulting in O(l) and Q(N) performance (respectively). 
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Random potentials (5 iterations) 2D Graph matching 




N (number of states) N (size of target graph) 



Figure 13: The running time of our method on randomly generated potentials, and on a graph 
matching experiment (both graphs have the same topology). Fitted curves are also obtained 
by performing least-squares regression; the residual error r indicates the 'goodness' of the fitted 
curve. 



(2008) for details). We fix M = 10 and vary N. Performance is shown in Figure [131 Fitted curves 
are shown together with the actual running time of our algorithm, confirming its 0{MN 2 \fN) 
performance. The coefficients of the fitted curves demonstrate that our algorithm is useful even 
for modest values of N. 



We also report results for graph matching using graphs from the MPEG-7 dataset (IBai et al. 



which consists of 1,400 silhouette images. Again we fix M = 10 (i.e., 10 points are 



extracted in each template graph) and vary N (the number of points in the target graph). This 
experiment confirms that even when matching real- world graphs, the assumption of independent 
order-statistics appears to be reasonable. 



6.1.4 Higher-Order Markov Models 

In this experiment, we construct a simple Markov model for text-denoising. Random noise is 
applied to a text segment, which we try to correct using a prior extracted from a text corpus. 
For instance 



wondrous sight of th4 ivory Pequod is corrected to wondrous sight of the ivory 

Pequod. 

In such a model, we would like to exploit higher-order relationships between characters, 
though the amount of data required to construct an accurate prior grows exponentially with 
the size of the maximal cliques. Instead, our prior consists entirely of pairwise relationships 
between characters (or 'bigrams'); higher-order relationships are encoded by including bigrams 
of non-adjacent characters. Specifically, our model takes the form 

1*1-1 1*1-2 

*x(xj) = Yl ^M+ifo'^+i) x II ^+2(^,^+2) (19) 
1=1 1=1 

where 

$ij(xi,Xj) = i/>ij(xi,Xj)p(xi\oi)p(xj\oj). (20) 
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2D Graph matching (MPEG-7 data) 



140 




N (size of target graph) 



Figure 14: The running time of method our on graphs from the MPEG-7 dataset. 




Figure 15: Left: Our model for denoising. Its computational complexity is similar to that of a 
skip-chain CRF, and models for named-entity recognition (right). 

Here ^ is our prior (extracted from text statistics), and p is our 'noise model' (given the 
observation o). The computational complexity of inference in this model is similar to that of the 
skip-chain CRF shown in Figure E](b), as well as models for part-of-speech tagging and named- 
entity recognition, as in Figure [T5j Text denoising is useful for the purpose of demonstrating 
our algorithm, as there are several different corpora available in different languages, allowing 
us to explore the effect that the domain size (i.e., the size of the language's alphabet) has on 
running time. 

We extracted pairwise statistics based on 10,000 characters of text, and used this to correct a 
series of 25 character sequences, with 1% random noise introduced to the text. The domain was 
simply the set of characters observed in each corpus. The Japanese dataset was not included, 
as the ®(MN 2 ) memory requirements of the algorithm made it infeasible with N ~ 2000; this 
is addressed in Section 16.2.11 

The running time of our method, compared to the naive solution, is shown in Figure [TBI One 
might expect that texts from different languages would exhibit different dependence structures 
in their order statistics, and therefore deviate from expected case in some instances. However, 
the running times appear to follow the fitted curve closely, i.e., we are achieving approximately 
the expected-case performance in all cases. 

Since the prior ^2,2+1(^5 x i+i) is data-independent, we shall further discuss this type of model 
in reference to Algorithm [6] in Section 16.21 



6.1.5 Protein Design 



In ISontag et al.l (120081 ) , a method is given for exact MAP-inference in graphical models using 
LP-relaxations. Where exact solutions cannot be obtained by considering only pairwise factors, 
'clusters' of pairwise terms are introduced in order to refine the solution. Message-passing in 
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Figure 16: The running time of our method compared to the naive solution. A fitted curve is 
also shown, whose coefficient estimates the computational overhead of our model. 



these clusters turns out to take exactly the form that we consider, as third-order (or larger) 
clusters are formed from pairwise terms. Although a number of applications are presented in 
ISontag et al. (120081 ). we focus on protein design, as this is the application in which we typically 
observe the largest domain sizes. Other applications with larger domains may yield further 
benefits. 

Without going into detail, we simply copy the two equations in lSontag et al. to which 

our algorithm applies. The first of these is concerned with passing messages between clusters, 
while the second is concer ned with choo s ing; n ew clusters to add. Below are the two equations, 
reproduced verbatim from Sontag et al. (j2C 



■(A e ^ e (x e )+ ^2 A c /^ e (x e )) + -max ^ (A e /^ e /(x e /) + 

O x c\e L . 



c\e 



c / /c,e / Ec' 



(av)) 
(21) 

(see ISontag et all 120081 . Figure 1, bottom), which consists of marginalizing a cluster (c) that 
decomposes into edges (e), and 



d(c) 



max6 e (x e ) — max 



£ 

,eEc 



b e (x e 



(22) 



(see Sontag et al. , 20081 , (eq. 4)), which consists of finding the MAP state in a ring-structured 
model. 

As the code from ISontag et all ( 20081 ) was publicly available, we simply replaced the ap- 
propriate functions with our own (in order to provide a fair comparison, we also replaced their 
implementation of the naive algorithm, as ours proved to be faster than the highly generic matrix 
library used in their code). 

In order to improve the running time of our algorithm, we made the following two modifica- 
tions to Algorithm [2j 

• We used an adaptive sorting algorithm (i.e., a sorting algorithm that runs faster on nearly- 
sorted data). While quicksort was used during the first iteration of message-passing, sub- 
sequent iterations used insertion sort, as the optimal ordering did not change significantly 
between iterations. 
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Figure 17: The running time of our method on protein design problems from ISontag et al 

tosh . 

• We added an additional stopping criterion to the algorithm. Namely, we terminate the 
algorithm if v a [Pa [start]] x v^p^start]] < max. In other words, we check how large the 
maximum could be given the best possible permutation of the next elements (i.e., if they 
have the same index); if this value could not result in a new maximum, the algorithm ter- 
minates. This check costs us an additional multiplication, but it means that the algorithm 
will terminate faster in cases where a large maximum is found early on. 

Results for these two problems ar e shown in Fi g ure [T71 Although our algorithm consistently 



improves upon the running time of ISontag et al.1 (120081 ) , the domain size of the variables in 



question is not typically large enough to see a marked improvement. Interestingly, neither 
method follows the expected running time closely in this experiment. This is partly due to the 
fact that there is significant variation in the variable size (note that N only shows the average 
variable size), but it may also suggest that there is a complicated structure in the potentials 
which violates our assumption of independent order statistics. 

6.2 Experiments with Data-Independent Factors 

In each of the following experiments we perform belief-propagation in models of the form given 
in (eq. [T6l) . Thus each model is completely specified by defining the node potentials &i(xi\yi), 
the edge potentials $^j(x^, xj), and the topology (Af,£) of the graph. 

Furthermore we assume that the edge potentials are homogeneous, i.e., that the potential 
for each edge is the same, or rather that they have the same order statistics (for example, they 
may differ by a multiplicative constant). This means that sorting can be done online without 
affecting the asymptotic complexity. When subject to heterogeneous potentials we need merely 
sort them offline; the online cost shall be similar to what we report here. 

6.2.1 Chain-Structured Models 

In this section, we consider chain- structured graphs. Here we have nodes Af = {1 . . . Q}, and 
edges £ = {(1, 2), (2, 3) . . . (Q — 1, Q)}. The max-product algorithm is known to compute the 
maximum-likelihood solution exactly for tree-structured models. 

Figure [18] (left) shows the performance of our method on a model with random potentials, 
i.e., §i(xi\yi) = f7[0, 1), <E>^ + i(a^, x^ + i) = J7[0, 1), where U[0, 1) is the uniform distribution. 
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Figure 18: Running time of inference in chain-structured models: random potentials (left), and 
text denoising (right). Fitted curves confirm that the exponent of our method is indeed 1.5 (r 
denotes the sum of residuals, i.e., the 'goodness' of the fitted curve). 



Fitted curves are superimposed onto the running time, confirming that the performance of the 
standard solution grows quadraticalfy with the number of states, while ours grows at a rate of 
Ny/N. The residual error r shows how closely the fitted curve approximates the running time; 
in the case of random potentials, both curves have almost the same constant. 

Figure [18] (right) shows the performance of our method on the text-denoising experiment. 
This experiment is essentially identical to that shown in Section [6.1. 4[ except that the model is 
a chain (i.e., there is no $^+2), and we exploit the notion of data-independence (i.e., the fact 
that $2,2+1 does not depend on the observation). Since the same $2,2+1 is used for every adjacent 
pair of nodes, there is no need to perform the 'sorting' step offline - only a single copy of $2,2+1 
needs to be sorted, and this is included in the total running time shown in Figure ITSl 



6.2.2 Grid-Structured Models 

Similarly, we can apply our method to grid- structured models. Here we resort to loopy belief- 
propagation to appro ximate the MAP solution , though indeed the same analysis applies in the 



case of factor graphs ( Kschischang et alll200ll ). We construct a 50 x 50 grid model and perform 



loopy belief-propagation using a random message-passing schedule for five iterations. In these 
experiments our nodes are M = {1 . . . 50} 2 , and our edges connect the 4-neighbors, i.e., the node 
(i, j) is connected to both (i + 1, j) and (z, j + 1) (similar to the grid shown in Figure [21(a)). 

Figure [19] (left) shows the performance of our method on a grid with random potentials 
(similar to the experiment in S ection I6.2.1D. Figu r e [T9l (right) shows the performance of our 
method on an optical flow task flLucas and Kanadel . Il98ll ). Here the states encode flow vectors: 
for a node with N states, the flow vector is assumed to take integer coordinates in the square 
[— \fN /2, y/~N /2) 2 (so that there are N possible flow vectors). For the unary potential we have 

*(ij)(x\y) = \\lm 1 [i,j}-Im 2 [(i,j) + f(x)}\\, (23) 

where lm\ [a, b] and Irri2[a,b] return the gray-level of the pixel at (a, b) in the first and second 
images (respectively), and f(x) returns the flow vector encoded by x. The pairwise potentials 
simply encode the Euclidean distance between two flow vectors. Note that a variety of low- level 



computer vision tasks (including optical flow) are studied in iFelzenszwalb and Huttenlocher 
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Figure 19: Running time of inference in grid-structured models: random potentials (left), and 
optical flow (right). 



(2006), where the highly structured nature of the potentials in question often allows for efficient 
solutions. 

Our fitted curves in Figure fT9l show 0(Ny/~N) performance for both random data and for 
optical flow. 



6.2.3 Failure Cases 



In our previous experiments on graph-matching, text denoising, and optical flow we observed 
running times similar to those for random potentials, indicating that there is no prevalent 
dependence structure between the order statistics of the messages and the potentials. 

In certain applications the order statistics of these terms are highly dependent. The most 
straightforward example is that of concave potentials (or convex potentials in a min-sum for- 
mulation). For instance, in a stereo disparity experiment, the unary potentials encode the fact 
that the output should be 'close to' a certain value; the pairwise potentials encode the fact th at 



neighboring nodes should take similar values (IScharstein and Szeliski . 2001 : Sun et al. . 20031 ) . 



Whenever both v a and are concave in (eq. [T]), the permutation matrix that transforms the 
sorted values of v a to the sorted values of is block-off-diagonal (see the sixth permutation in 
Figure [TT1) . In such cases, our algorithm only decreases the number of multiplication operations 
by a multiplicative constant, and may in fact be slower due to its computational overhead. This 
is precisely the behavior shown in Figure [20] (left), in the case of stereo disparity. 

It should be noted that there exist algori t hms specifically designed for this clas s of poten- 
tial functions (jKolmogorov and Shioura . 2007 : Felzenszwalb and Huttenlocher . 20061 ) . which are 
preferable in such instances. 

We similarly perform an ex periment on image d enois ing, where the una ry p otent ials are again 
convex functions of the input iGeman and Gema lLan et al.l (see|2006). Instead of 

using a pairwise potential that merely encodes smoothness, we extract the pairwise statistics 
from image data (similar to our experiment on text denoising); thus the potentials are no 
longer concave. We see in Figure [20] (right) that even if a small number of entries exhibit some 
'randomness' in their order statistics, we begin to gain a modest speed improvement over the 
naive solution (though indeed, the improvements are negligible compared to those shown in 
previous experiments). 
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Stereo disparity (50 x 50 grid, 5 iterations) Image denoising (50 x 50 grid, 5 iterations) 




N (number of states) N (number of states) 



Figure 20: Two experiments whose potentials and messages have highly dependent order statis- 
tics: stereo disparity (left), and image denoising (right). 



7 Discussion and Future Work 

As we touched upon briefly in Section [21 there are a variety of applications of our algorithm 
beyond graphical models - what we have in fact presented is a solution to the problem of funny 
matrix mu l tiplic ation, which generalizes to matrices of arbitrary dimension. For instance, in 



Aho et ajj (J1983) a transformation is given between funny matrix multiplication and all-pairs 
shortest path, meaning that our alg orithm results in a sub-cubic solution to this problem. While 
the fastest known solution (due to Karger et al. . 19931 ) has running time 0{N 2 log N) (subject to 



certain assumptions on the input graph), its implementation requires a Fibonacci heap, meaning 
that our algorithm proves to be faster for reasonable values of N. 

It is interesting to consider the fact that our algorithm's running time is purely a function 
of the input data's order statistics, and in fact does not depend on the data itself. While it is 
pleasing that our assumption of independent order statistics appears to be a weak one, and is 
satisfied in a wide variety of applications, it ignores the fact that stronger assumptions may be 
reasonable in many cases. In factors with a high dynamic range, or when different factors have 
different scales, it may be possible to identify the maximum value very quickly, as we attempted 
to do in Section 16.1.51 Deriving faster algorithms that make stronger assumptions about the 
input data remains a promising avenue for future work. 

Our algorithm may also lead to faster solutions for approximate inference in graphical models. 
While the stopping criterion of our algorithm guarantees that the maximum value is found, it 
is possible to terminate the algorithm earlier and state that the maximum has probably been 
found. A direction for future work would be to adapt our algorithm to determine the probability 
that the maximum has been found after a certain number of steps; we could then allow the user 
to specify an error probability, or a desired running time, and our algorithm could be adapted 
accordingly. 



8 Conclusion 

We have presented a series of approaches that allow us to improve the performance of exact and 
approximate max-product message-passing for models with factors smaller than their maximal 
cliques, and more generally, for models whose factors that depend upon the observation contain 



28 



fewer latent variables than their maximal cliques. We are always able to improve the expected 
computational complexity in any model that exhibits this type of factorization, no matter the 
size or number of factors. Our improvements increase the class of problems for which inference 
via max-product belief-propagation is a tractable option. 
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A Asymptotic Performance of Algorithm [2] and Extensions 



In this section we shall determine the expected case running times of Algorithm [2] and Algorithm 
[5j Algorithm E] traverses v a and V5 until it reaches the smallest value of rn for which there is 
some j < m for which m > p^lpalj]]- If M is a random variable representing this smallest value 
of m, then we wish to find E(M). While E(M) is the number of 'steps' the algorithms take, 
each step takes Q(K) when we have K lists. Thus the expected running time is ®(KE(M)). 

To aid understanding our algorithm, we show the elements being read for specific examples of 
v a and V5 in Figure [2TJ This figure reveals that the actual values in v a and V5 are unimportant, 
and it is only the order-statistics of the two lists that determine the performance of our algorithm. 
By representing a permutation of the digits 1 to N as shown in Figure [22] ((a), (b), and (d)), 
we observe that m is simply the width of the smallest square (expanding from the top left) that 
includes an element of the permutation (i.e., it includes i and p[i\). 
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Figure 21: (a) The lists v a and V5 before sorting; (b) Black squares show corresponding elements 
in the sorted lists (v a [p a [i]] and v&[pb[i]]); red squares indicate the elements read during each 
step of the algorithm (y a [p a [start]] and v&[p& [start]]). We can imagine expanding a gray box of 
size start x start until it contains an entry; note that the maximum is found during the first 
step. 



Simple analysis reveals that the probability of choosing a permutation that does not contain 
a value inside a square of size m is 



P(M > m) 



(N — m)\(N — m)\ 
(iV-2ra)!M 



(24) 



This is precisely 1 — F(m), where F(rn) is the cumulative density function of M . It is immediately 
clear that 1 < M < [N/2\, which defines the best and worst-case performance of Algorithm El 
Using the identity E(X) = J2T=i P{X > x), we can write down a formula for the expected 
value of M: 

[N/2i (N - rn)\(N - rn)\ 



E(M) = J2 



m=0 



(N - 2m)\N\ 



(25) 



The case where we are sampling from multiple permutations simultaneously (i.e., Algorithm 
EI) is analogous. We consider K — 1 permutations embedded in a K-dimensional hypercube, and 
we wish to find the width of the smallest shaded hypercube that includes exactly one element 
of the permutations (i.e., i,pi[i], . . . ,pK-i[i])- This is represented in Figure l22T c) for K — 3. 
Note carefully that K is the number of lists in (eq. fl3|) : if we have K lists, we require K — 1 
permutations to define a correspondence between them. 

Unfortunately, the probability that there is no non-zero entry in a cube of size m K is not 
trivial to compute. It is possible to write down an expression that generalizes (eq. 1211) . such as 



P K (M > m) 



N \K- 



E 



A 



max (7k(i) > m 
ke{i...K-i} v ' 



(26) 
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(a) (b) (c) (d) 



Figure 22: (a) As noted in Figure EH a permutation can be represented as an array, where 
there is exactly one non-zero entry in each row and column; (b) We want to find the smallest 
value of m such that the grey box includes a non-zero entry; (c) A pair of permutations can be 
thought of as a cube, where every two-dimensional plane contains exactly one non-zero entry; 
we are now searching for the smallest grey cube that includes a non-zero entry; the faces show 
the projections of the points onto the exterior of the cube (the third face is determined by the 
first two); (d) For the sake of establishing an upper-bound, we consider a shaded region of width 
f(N) and height m. 



(in which we simply enumerate over all possible permutations and 'count' which of them do not 
fall within a hypercube of size m K ), and therefore state that 

oo 

E K {M) = J2 pK ( M > m )- ( 27 ) 

m=0 

However, it is very hard to draw any conclusions from (eq. [26]) , and in fact it is intractable even 
to evaluate it for large values of N and K. Hence we shall instead focus our attention on finding 
an upper-bound on (eq. [2?]) . Finding more computationally convenient expressions for (eq. [2Sj) 
and (eq. 127]) remains as future work. 

A.l An Upper-Bound on E K (M) 

Although (eq. [25]) and (eq. [27]) precisely define the running times of Algorithm E] and Algorithm 
El it is not easy to ascertain the speed improvements they achieve, as the values to which the 
summations converge for large N are not obvious. Here, we shall try to obtain an upper-bound 
on their performance, which we assessed experimentally in Section [61 In doing so we shall prove 
Theorems [T] and El 

Proof of Theorem^ (see Algorithm Ej) Consider the shaded region in Figure [221( d) . This region 
has a width of f(N), and its height m is chosen such that it contains precisely one non-zero 
entry. Let M be a random variable representing the height of the grey region needed in order 
to include a non-zero entry. We note that 

E(M) G 0(f(N)) => E{M) G 0(f(N)); (28) 

our aim is to find the smallest f(N) such that E(M) E 0(f(N)). The probability that none of 
the first m samples appear in the shaded region is 

P(M > m) = [J (l - ^) • (29) 

4=0 ^ ' 
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Next we observe that if the entries in our N x N grid do not define a permutation, but we 
instead choose a random entry in each row, then the probability (now for M) becomes 

P(M>m)= (l-t^iy (30) 

(for simplicity we allow m to take arbitrarily large values). We certainly have that P(M > m) > 
P(M > m), meaning that E(M) is an upper-bound on E(M), and therefore on E(M). Thus 
we compute the expected value 

E(M) = ±(l-^p-Y. (31) 



This is just a geometric progression, which sums to N/f(N). Thus we need to find f(N) such 
that 

<32> 



Clearly f(N) E 0{y/N) will do. Thus we conclude that 

E{M) e 0(VN). (33) 

□ 

Proof of Theorem \M (see Algorithm [5j) We would like to apply the same reasoning in the case 
of multiple permutations in order to compute a bound on E K (M). That is, we would like to 
consider K — 1 random samples of the digits from 1 to iV, rather than K — 1 permutations, as 
random samples are easier to work with in practice. 

To do so, we begin with some simple corollaries regarding our previous results. We have 
shown that in a permutation of length TV, we expect to see a value less than or equal to / after 
N/f steps. There are now f — I other values that are less than or equal to / amongst the 
remaining N — N/f values; we note that 

f - 1 f 



N-f N 

Hence we expect to see the next value less than or equal to / in the next N/f steps also. A 
consequence of this fact is that we not only expect to see the first value less than or equal to 
/ earlier in a permutation than in a random sample, but that when we sample m elements, we 
expect more of them to be less than or equal to / in a permutation than in a random sample. 

Furthermore, when considering the maximum of K — 1 permutations, we expect the first rn 
elements to contain more values less than or equal to / than the maximum of K — 1 random 
samples, (eq. l26l) is concerned with precisely this problem. Therefore, when working in a K- 
dimensional hypercube, we can consider K — 1 random samples rather than K — 1 permutations 
in order to obtain an upper-bound on (eq. l27l) . 

Thus we define M as in (eq. [30]) , and conclude that 

P(M > m) = (l - /( ^gf~ 1 ) m - (35) 

Thus the expected value of M is again a geometric progression, which this time sums to 
(N/f(N, K)) K_1 . Thus we need to find f(N, K) such that 

/ / N x 
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Clearly 

f(N,K) GO(iV^ i ) (37) 

will do. As mentioned, each step takes O(i^), so the final running time is 0(KN^^). □ 

To summarize, for problems decomposable into K + 1 groups, we will need to find the index 
that chooses the maximal product amongst K lists; we have shown an upper-bound on the 
expected number of steps this takes, namely 

E K {M) e O (N^) • (38) 
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